This is the cleaned version of my regression analysis for my STA302 final project, as my old Markdown file had too much busy work. I will document my data and create graphs here for the sake of organization while writing my final report.
We will work with two datasets: savant_data.csv, which contains the bulk of our working data courtesy of Major League Baseball’s Statcast, and pitchbot.csv, which contains command data courtesy of Cameron Groove’s Pitching Bot on Fangraphs.
Let’s begin by loading in the necessary libraries and cleaning our data.
# Libraries
library(car)
library(tidyverse)
# Loading and cleaning data
fb <- read_csv("savant_data.csv")
fb <- fb %>% select(player_id, total, `downward movement w/ gravity (in)`,
`glove/arm-side movement (in)`, whiff_percent, `pitch (MPH)`,
`spin (RPM)`, `vertical release pt (ft)`, `extension (ft)`)
bot <- read_csv("botcmd.csv") %>% select(player_id, `botCmd FA`)
# Merging and cleaning data
fastballs <- merge(fb, bot, by = 'player_id')
renamed_cols <- c('downwards_mvmt' = 'downward movement w/ gravity (in)',
'side_mvmt' = 'glove/arm-side movement (in)',
'speed' = 'pitch (MPH)',
'spin' = 'spin (RPM)',
'release_pt' = 'vertical release pt (ft)',
'extension' = 'extension (ft)',
'command' = 'botCmd FA')
fastballs <- rename(fastballs, renamed_cols)
Now we will preview our data
head(fastballs)
## player_id total downwards_mvmt side_mvmt whiff_percent speed spin release_pt
## 1 425844 2219 18.2 2.9 15.9 89.5 2248 6.29
## 2 434378 2602 12.2 8.0 18.1 94.3 2428 7.03
## 3 448179 2531 17.6 6.3 21.8 88.4 2203 5.91
## 4 450203 2842 19.5 13.3 22.5 94.9 2307 5.45
## 5 453286 2465 15.5 10.8 23.8 93.7 2360 5.48
## 6 458681 3167 18.4 3.2 31.6 92.4 2424 5.64
## extension command
## 1 6.0 62
## 2 6.0 49
## 3 6.3 49
## 4 6.2 44
## 5 6.3 59
## 6 6.5 56
Now we will begin training different models. We will split our model 75-25 into training-validation data.
set.seed(1000)
fastballs_training <- fastballs %>% sample_frac(size = 0.8)
fastballs_validation <- anti_join(fastballs, fastballs_training)
(eda)
summarize(fastballs_training, pitchers = n(),
`min speed` = min(speed), `max speed` = max(speed),
`average speed` = mean(speed), `median speed` = median(speed),
`sd speed` = sd(speed))
## pitchers min speed max speed average speed median speed sd speed
## 1 120 88.4 99.1 94.33083 94.5 2.140753
We will proceed by backwards elimination, so we will create our first model using every possible parameter.
reg_all <- lm(whiff_percent ~ speed + spin + downwards_mvmt +
side_mvmt + release_pt + extension + command,
data = fastballs_training)
summary(reg_all)
##
## Call:
## lm(formula = whiff_percent ~ speed + spin + downwards_mvmt +
## side_mvmt + release_pt + extension + command, data = fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2833 -2.3264 0.0322 2.8411 10.6880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.141909 28.986523 -0.281 0.779316
## speed 0.401817 0.238051 1.688 0.094205 .
## spin 0.007566 0.002817 2.686 0.008339 **
## downwards_mvmt -0.544637 0.216371 -2.517 0.013246 *
## side_mvmt 0.263241 0.131540 2.001 0.047787 *
## release_pt -4.211420 1.094624 -3.847 0.000199 ***
## extension 0.786705 0.941384 0.836 0.405109
## command 0.008784 0.049753 0.177 0.860179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.069 on 112 degrees of freedom
## Multiple R-squared: 0.3804, Adjusted R-squared: 0.3416
## F-statistic: 9.821 on 7 and 112 DF, p-value: 1.675e-09
Let’s check the assumptions of normality, linearity, and constant
variance:
Now let’s also check multicollinearity by looking at the Variance
Inflation Factor:
vif(reg_all)
## speed spin downwards_mvmt side_mvmt release_pt
## 1.866332 1.185329 2.117692 1.380988 1.595473
## extension command
## 1.114783 1.046809
Based on the above, it seems like there may be some outliers. It also appears like every assumption is satisfied, except potentially linearity and multicollinearity. The former we might address through a transformation. The latter, while indeed indicating some degree of multicollinearity (the first five predictors in particular, since they seem to be a notable degree larger than 1), is not necessary to address since the VIF \(< 5\).
There are two possibilities for transformations: 1. We use Box-Cox to decide on a power transformation. 2. We use the arcsine variance stabilizing transformation, since our predicted value is a percentage.
Let’s see how both perform. First, to decide on a power transformation:
MASS::boxcox(reg_all)
Based on the Box-Cox plot, because \(1\) is closeset to the Maximum Likelihood
Estimator, no transformation seems necessary. Let’s proceed with the
arcsine transformation and see if it’s any help.
# Arcsine transformed response
asine_whiff_percent <- asin(sqrt(fastballs_training$whiff_percent / 100))
# Transformed regression
asine_reg_all <- lm(asine_whiff_percent ~ speed + spin + downwards_mvmt +
side_mvmt + release_pt + extension + command,
data = fastballs_training)
summary(asine_reg_all)
##
## Call:
## lm(formula = asine_whiff_percent ~ speed + spin + downwards_mvmt +
## side_mvmt + release_pt + extension + command, data = fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.169895 -0.027220 -0.000181 0.035785 0.128632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.509e-02 3.608e-01 0.264 0.792590
## speed 5.152e-03 2.963e-03 1.739 0.084795 .
## spin 8.907e-05 3.506e-05 2.540 0.012451 *
## downwards_mvmt -6.634e-03 2.693e-03 -2.463 0.015282 *
## side_mvmt 3.403e-03 1.637e-03 2.078 0.039969 *
## release_pt -5.011e-02 1.362e-02 -3.678 0.000363 ***
## extension 9.376e-03 1.172e-02 0.800 0.425302
## command 8.799e-05 6.193e-04 0.142 0.887265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05065 on 112 degrees of freedom
## Multiple R-squared: 0.3715, Adjusted R-squared: 0.3322
## F-statistic: 9.457 on 7 and 112 DF, p-value: 3.511e-09
par(mfrow = c(1,2))
plot(asine_reg_all)
plot(reg_all)
It actually looks a little worse. Either way, it looks like point 27, associated with Shane Bieber, is the worst performer. Let’s look at his standardized residual:
rstandard(reg_all)[27]
## 27
## -3.105265
rstandard(asine_reg_all)[27]
## 27
## -3.450755
At this range, whether or not we remove him seems to be a bit of a judgement call. Let’s try removing him and see what happens.
updated_fastballs_training <- fastballs_training %>% filter(player_id != 669456)
reg_all_removed <- lm(whiff_percent ~ speed + spin + downwards_mvmt +
side_mvmt + release_pt + extension + command,
data = updated_fastballs_training)
summary(reg_all_removed)
##
## Call:
## lm(formula = whiff_percent ~ speed + spin + downwards_mvmt +
## side_mvmt + release_pt + extension + command, data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1095 -2.3807 -0.0609 2.6417 10.7738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.480854 28.066282 0.124 0.90152
## speed 0.268805 0.232267 1.157 0.24963
## spin 0.007614 0.002705 2.814 0.00578 **
## downwards_mvmt -0.605074 0.208616 -2.900 0.00449 **
## side_mvmt 0.314302 0.127298 2.469 0.01507 *
## release_pt -4.273711 1.051322 -4.065 9e-05 ***
## extension 0.958623 0.905555 1.059 0.29208
## command 0.021236 0.047932 0.443 0.65860
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.908 on 111 degrees of freedom
## Multiple R-squared: 0.4014, Adjusted R-squared: 0.3636
## F-statistic: 10.63 on 7 and 111 DF, p-value: 3.482e-10
par(mfrow = c(1,2))
plot(reg_all_removed)
# Arcsine version
arcsine_reg_all_removed <- lm(asin(sqrt(updated_fastballs_training$whiff_percent / 100)) ~
speed + spin + downwards_mvmt + side_mvmt + release_pt +
extension + command, data = updated_fastballs_training)
summary(arcsine_reg_all_removed)
##
## Call:
## lm(formula = asin(sqrt(updated_fastballs_training$whiff_percent/100)) ~
## speed + spin + downwards_mvmt + side_mvmt + release_pt +
## extension + command, data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.135826 -0.028609 -0.000492 0.032223 0.129819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.559e-01 3.454e-01 0.741 0.460468
## speed 3.313e-03 2.859e-03 1.159 0.249048
## spin 8.973e-05 3.330e-05 2.695 0.008141 **
## downwards_mvmt -7.470e-03 2.568e-03 -2.909 0.004376 **
## side_mvmt 4.109e-03 1.567e-03 2.622 0.009956 **
## release_pt -5.097e-02 1.294e-02 -3.939 0.000143 ***
## extension 1.175e-02 1.115e-02 1.055 0.293925
## command 2.602e-04 5.900e-04 0.441 0.660018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0481 on 111 degrees of freedom
## Multiple R-squared: 0.3975, Adjusted R-squared: 0.3595
## F-statistic: 10.46 on 7 and 111 DF, p-value: 4.861e-10
par(mfrow = c(2,2))
plot(arcsine_reg_all_removed)
Normality seems to look better in both cases, although it still seems
like the non-transformed data looks best. Speed is no longer
significant, and downwards movement and side movement have become more
significant. Let’s proceed with with the non-transformed data set with
Bieber removed.
The least significant of the previous ones was side movement. Let’s try removing that.
reg_sdr <- lm(whiff_percent ~ spin + downwards_mvmt + release_pt,
data = updated_fastballs_training)
summary(reg_sdr)
##
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + release_pt,
## data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6167 -2.4733 0.1696 2.3800 10.6984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.451986 10.969020 4.782 5.19e-06 ***
## spin 0.006236 0.002726 2.287 0.024 *
## downwards_mvmt -0.793945 0.170495 -4.657 8.69e-06 ***
## release_pt -5.609603 0.985721 -5.691 9.80e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.049 on 115 degrees of freedom
## Multiple R-squared: 0.334, Adjusted R-squared: 0.3167
## F-statistic: 19.23 on 3 and 115 DF, p-value: 3.571e-10
Was this okay? Let’s validate with the partial \(F\) test.
anova(reg_all_removed, reg_sdr)
## Analysis of Variance Table
##
## Model 1: whiff_percent ~ speed + spin + downwards_mvmt + side_mvmt + release_pt +
## extension + command
## Model 2: whiff_percent ~ spin + downwards_mvmt + release_pt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 111 1694.9
## 2 115 1885.7 -4 -190.74 3.1229 0.01778 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like even this was significant, so no, we cannot proceed.
Let’s try removing everything then.
reg_sdsr <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt +
release_pt, data = updated_fastballs_training)
summary(reg_sdsr)
##
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + side_mvmt +
## release_pt, data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4660 -2.2224 -0.0577 2.7538 11.0474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.483699 11.034128 3.850 0.000195 ***
## spin 0.007335 0.002650 2.768 0.006586 **
## downwards_mvmt -0.786147 0.164290 -4.785 5.16e-06 ***
## side_mvmt 0.353517 0.112469 3.143 0.002130 **
## release_pt -4.811014 0.983130 -4.894 3.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.901 on 114 degrees of freedom
## Multiple R-squared: 0.3871, Adjusted R-squared: 0.3656
## F-statistic: 18 on 4 and 114 DF, p-value: 1.747e-11
Let’s confirm if we can remove speed, extension, and command using a partial \(F\) test.
anova(reg_all_removed, reg_sdsr)
## Analysis of Variance Table
##
## Model 1: whiff_percent ~ speed + spin + downwards_mvmt + side_mvmt + release_pt +
## extension + command
## Model 2: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 111 1694.9
## 2 114 1735.3 -3 -40.348 0.8808 0.4534
Since \(p = 0.4534\), it looks like
we are good to go. Let’s recheck our assumptions:
Normality still looks good, although there seems to be some fanning behavior in the residuals. Let’s re-introduce the arcsine transformation and see if it’s any help this time around.
# Arcsine version
arcsine_reg_sdsr_removed <- lm(asin(sqrt(updated_fastballs_training$whiff_percent / 100)) ~
spin + downwards_mvmt + side_mvmt + release_pt,
data = updated_fastballs_training)
summary(arcsine_reg_sdsr_removed)
##
## Call:
## lm(formula = asin(sqrt(updated_fastballs_training$whiff_percent/100)) ~
## spin + downwards_mvmt + side_mvmt + release_pt, data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.140190 -0.025379 0.001613 0.032034 0.133183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.358e-01 1.358e-01 5.418 3.40e-07 ***
## spin 8.633e-05 3.262e-05 2.647 0.00928 **
## downwards_mvmt -9.699e-03 2.022e-03 -4.797 4.92e-06 ***
## side_mvmt 4.594e-03 1.384e-03 3.319 0.00122 **
## release_pt -5.757e-02 1.210e-02 -4.758 5.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04802 on 114 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3616
## F-statistic: 17.71 on 4 and 114 DF, p-value: 2.488e-11
par(mfrow = c(1,2))
plot(arcsine_reg_sdsr_removed)
plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$spin,
main = 'Spin',
xlab = 'Spin (RPM)', ylab = 'Residuals',
col = 'orange')
plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$downwards_mvmt,
main = 'Downwards Movement',
xlab = 'Downwards Movement (in)', ylab = 'Residuals',
col = 'goldenrod')
plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$side_mvmt,
main = 'Side Movement',
xlab = 'Side Movement (in)', ylab = 'Residuals',
col = 'darkseagreen')
plot(arcsine_reg_sdsr_removed$residuals ~ updated_fastballs_training$release_pt,
main = 'Release Point',
xlab = 'Release Point (ft)', ylab = 'Residuals',
col = 'cadetblue')
If anything, it looks worse. How about WLS?
weights <- 1 / lm(abs(reg_sdsr$residuals) ~ reg_sdsr$fitted.values)$fitted.values^2
wls_reg_sdsr <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt,
data = updated_fastballs_training, weights = weights)
summary(wls_reg_sdsr)
##
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + side_mvmt +
## release_pt, data = updated_fastballs_training, weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -3.4502 -0.7408 0.0046 0.9206 3.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.332631 11.089197 3.998 0.000114 ***
## spin 0.007119 0.002647 2.689 0.008241 **
## downwards_mvmt -0.805156 0.166424 -4.838 4.14e-06 ***
## side_mvmt 0.345446 0.112552 3.069 0.002683 **
## release_pt -4.984196 0.976764 -5.103 1.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.292 on 114 degrees of freedom
## Multiple R-squared: 0.3927, Adjusted R-squared: 0.3714
## F-statistic: 18.43 on 4 and 114 DF, p-value: 1.049e-11
It looks like this model does the best at not validating assumptions. Let’s check the last one: Multicollinearity
vif(wls_reg_sdsr)
## spin downwards_mvmt side_mvmt release_pt
## 1.153141 1.353217 1.096323 1.425529
Which appears to still be fine.
Can we go simpler? Let’s pick the two most significant parameters.
reg_dr <- lm(whiff_percent ~ downwards_mvmt + release_pt,
data = updated_fastballs_training)
summary(reg_dr)
##
## Call:
## lm(formula = whiff_percent ~ downwards_mvmt + release_pt, data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7028 -2.5705 0.0924 2.5525 10.2306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.8809 7.0658 10.173 < 2e-16 ***
## downwards_mvmt -0.9084 0.1659 -5.475 2.57e-07 ***
## release_pt -6.2082 0.9675 -6.417 3.15e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.123 on 116 degrees of freedom
## Multiple R-squared: 0.3037, Adjusted R-squared: 0.2917
## F-statistic: 25.3 on 2 and 116 DF, p-value: 7.603e-10
Now let’s conduct a partial F test to see if this was okay.
anova(wls_reg_sdsr, reg_dr)
## Analysis of Variance Table
##
## Model 1: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Model 2: whiff_percent ~ downwards_mvmt + release_pt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 114 190.35
## 2 116 1971.45 -2 -1781.1 533.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which is very significant, so it appears like at least one of spin and side movement was significant. Let’s take them up individually.
reg_sdr <- lm(whiff_percent ~ spin + downwards_mvmt + release_pt,
data = updated_fastballs_training)
summary(reg_sdr)
##
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + release_pt,
## data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6167 -2.4733 0.1696 2.3800 10.6984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.451986 10.969020 4.782 5.19e-06 ***
## spin 0.006236 0.002726 2.287 0.024 *
## downwards_mvmt -0.793945 0.170495 -4.657 8.69e-06 ***
## release_pt -5.609603 0.985721 -5.691 9.80e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.049 on 115 degrees of freedom
## Multiple R-squared: 0.334, Adjusted R-squared: 0.3167
## F-statistic: 19.23 on 3 and 115 DF, p-value: 3.571e-10
anova(wls_reg_sdsr, reg_sdr)
## Analysis of Variance Table
##
## Model 1: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Model 2: whiff_percent ~ spin + downwards_mvmt + release_pt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 114 190.35
## 2 115 1885.66 -1 -1695.3 1015.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So side movement was important. What about if we remove spin?
reg_dsr <- lm(whiff_percent ~ downwards_mvmt + side_mvmt + release_pt,
data = updated_fastballs_training)
summary(reg_dsr)
##
## Call:
## lm(formula = whiff_percent ~ downwards_mvmt + side_mvmt + release_pt,
## data = updated_fastballs_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6596 -2.3802 -0.1801 2.7471 10.4662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0962 7.1981 9.182 2.08e-15 ***
## downwards_mvmt -0.9194 0.1616 -5.690 9.83e-08 ***
## side_mvmt 0.3125 0.1147 2.725 0.00744 **
## release_pt -5.5956 0.9683 -5.779 6.54e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.013 on 115 degrees of freedom
## Multiple R-squared: 0.346, Adjusted R-squared: 0.3289
## F-statistic: 20.28 on 3 and 115 DF, p-value: 1.285e-10
anova(wls_reg_sdsr, reg_sdr)
## Analysis of Variance Table
##
## Model 1: whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt
## Model 2: whiff_percent ~ spin + downwards_mvmt + release_pt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 114 190.35
## 2 115 1885.66 -1 -1695.3 1015.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which is also significant. So it seems like we cannot go simpler.
Let’s now compare our original model to our simplified model. We will use corrected AIC (since \(n = 119\), \(p \geq 4\), so \(\frac{n}{K} = \frac{119}{4 + 2} \leq 40\)), BIC, and adjusted \(R^2\) to decide.
models = c('All', 'Speed + Downwards Movement + Sideways Movement + Release Point')
n_predictors = c(7, 4)
adj_r_squared = c(summary(reg_all_removed)$adj.r.squared,
summary(wls_reg_sdsr)$adj.r.squared)
bic = c(BIC(reg_all_removed),
BIC(wls_reg_sdsr))
aic_corrected = c(AIC(reg_all_removed) + (2*9*10)/(119-9+1),
AIC(wls_reg_sdsr) + (2*6*7)/(103-6+1))
selection <- tibble(models, n_predictors, adj_r_squared, aic_corrected, bic)
selection
## # A tibble: 2 × 5
## models n_pre…¹ adj_r…² aic_c…³ bic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 All 7 0.364 673. 697.
## 2 Speed + Downwards Movement + Sideways Movement … 4 0.371 669. 685.
## # … with abbreviated variable names ¹n_predictors, ²adj_r_squared,
## # ³aic_corrected
So by all counts, our second model is better. We’ll pick that one to move onto our final step: Validation.
reg_sdsr_validation <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt,
data = fastballs_validation)
# summary(reg_sdsr_validation)
weights_val <- 1 / lm(abs(reg_sdsr_validation$residuals) ~ reg_sdsr_validation$fitted.values)$fitted.values^2
wls_reg_sdsr_val <- lm(whiff_percent ~ spin + downwards_mvmt + side_mvmt + release_pt,
data = fastballs_validation, weights = weights_val)
summary(wls_reg_sdsr_val)
##
## Call:
## lm(formula = whiff_percent ~ spin + downwards_mvmt + side_mvmt +
## release_pt, data = fastballs_validation, weights = weights_val)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.12604 -0.94320 0.05419 0.89935 3.01236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.387066 30.152078 2.301 0.02999 *
## spin 0.005538 0.008182 0.677 0.50473
## downwards_mvmt -1.119305 0.463606 -2.414 0.02341 *
## side_mvmt -0.134745 0.329510 -0.409 0.68608
## release_pt -7.124670 2.508304 -2.840 0.00883 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.325 on 25 degrees of freedom
## Multiple R-squared: 0.3663, Adjusted R-squared: 0.2649
## F-statistic: 3.612 on 4 and 25 DF, p-value: 0.01863
plot(wls_reg_sdsr_val)
par(mfrow = c(1,2))
plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$spin,
main = 'Spin',
xlab = 'Spin (RPM)', ylab = 'Residuals',
col = 'orange')
plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$downwards_mvmt,
main = 'Downwards Movement',
xlab = 'Downwards Movement (in)', ylab = 'Residuals',
col = 'goldenrod')
plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$side_mvmt,
main = 'Side Movement',
xlab = 'Side Movement (in)', ylab = 'Residuals',
col = 'darkseagreen')
plot(wls_reg_sdsr_val$residuals ~ fastballs_validation$release_pt,
main = 'Release Point',
xlab = 'Release Point (ft)', ylab = 'Residuals',
col = 'cadetblue')
# plot(reg_sdsr_validation)
It looks like point 15 has high leverage in this dataset (Felix Bautista), and point 12 and 2 are very close as well (. There also appear to be quite a few meaningful deviations away from the normal distribution. In particular, point 17 (Alex Faedo) and 15 (Felix Bautista) seem to be pretty far.
Let’s compare with our test regression:
models = c('Training', 'Validation')
n = c(119, 30)
adj_r_squared = c(summary(wls_reg_sdsr)$adj.r.squared,
summary(wls_reg_sdsr_val)$adj.r.squared)
intercept_estimate <- c(wls_reg_sdsr$coefficients[1],
wls_reg_sdsr_val$coefficients[1])
spin_estimate <- c(wls_reg_sdsr$coefficients[2],
wls_reg_sdsr_val$coefficients[2])
downwards_mvmt_estimate <- c(wls_reg_sdsr$coefficients[3],
wls_reg_sdsr_val$coefficients[3])
side_mvmt_estimate <- c(wls_reg_sdsr$coefficients[4],
wls_reg_sdsr_val$coefficients[4])
release_pt_estimate <- c(wls_reg_sdsr$coefficients[5],
wls_reg_sdsr_val$coefficients[5])
validation <- tibble(models, n, adj_r_squared, spin_estimate, downwards_mvmt_estimate,
side_mvmt_estimate, release_pt_estimate)
validation
## # A tibble: 2 × 7
## models n adj_r_squared spin_estimate downwards_mvmt_…¹ side_…² relea…³
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Training 119 0.371 0.00712 -0.805 0.345 -4.98
## 2 Validation 30 0.265 0.00554 -1.12 -0.135 -7.12
## # … with abbreviated variable names ¹downwards_mvmt_estimate,
## # ²side_mvmt_estimate, ³release_pt_estimate
So besides the spin estimate, they really aren’t too similar. Let’s also check that there’s no multicollinearity, since that’s never been an issue.
vif(wls_reg_sdsr_val)
## spin downwards_mvmt side_mvmt release_pt
## 1.112659 1.394939 1.649884 1.980719
Let’s take a look at the training dataset:
summarize(fastballs_validation, pitchers = n(),
`min whiff` = min(whiff_percent), `max whiff` = max(whiff_percent),
`average whiff` = mean(whiff_percent), `median whiff` = median(whiff_percent),
`sd whiff` = sd(whiff_percent))
## pitchers min whiff max whiff average whiff median whiff sd whiff
## 1 30 12.4 37.8 22.86667 21.15 6.428725
summarize(updated_fastballs_training, pitchers = n(),
`min whiff` = min(whiff_percent), `max whiff` = max(whiff_percent),
`average whiff` = mean(whiff_percent), `median whiff` = median(whiff_percent),
`sd whiff` = sd(whiff_percent))
## pitchers min whiff max whiff average whiff median whiff sd whiff
## 1 119 9.7 31.5 22.06303 21.7 4.89851
Let’s also look at the number of total fastballs thrown by each group.
mean(updated_fastballs_training$total)
## [1] 1981.277
mean(fastballs_validation$total)
## [1] 1884.667
Looks like the second group, on average, threw 100 less fastballs than the first. This might explain some of the deviations.